Assessing Model Fit with Synthetic vs. Real Data

نویسندگان

  • Behzad Beheshti
  • Michel C. Desmarais
چکیده

Assessing whether a model is a good fit to the data is non trivial. The standard practice is to compare a few machine learning techniques to learn a model from data, and pick the one with the highest predictive performance. The winner is considered the best fitting model. But each model may involve different machine learning algorithms that carry their own set of parameters and constraints imposed on the corresponding model. This results in a large space in which to explore model performance. The actual best fitting model may have been overlooked due to an unfortunate choice of the algorithm's parameters. We address this issue by complementing performance model comparison with a method that combines real data with synthetic data generated with the competing models. We naturally expect a model to perform best over its corresponding synthetic data, but the analysis across the other synthetic data sets provides some indication of the models generality and robustness under different assumptions about the data. Results of our investigation in the domain of educational data mining show that a model performance is, as expected, best when tested over synthetic data generated aligned with this model. But we observe much greater performance contrasts across synthetic data than across real data. The performance pattern of each model over a given synthetic data set results in a kind of “signature”. We discuss the significance of this signature to assess model fit, and whether it can provide cues to the data's underlying ground truth.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Goodness of Fit of Skills Assessment Approaches: Insights from Patterns of Real vs. Synthetic Data Sets

This study investigates the issue of the goodness of fit of different skills assessment models using both synthetic and real data. Synthetic data is generated from the different skills assessment models. The results show wide differences of performances between the skills assessment models over synthetic data sets. The set of relative performances for the different models create a kind of “sign...

متن کامل

On the Canonical-Based Goodness-of-fit Tests for Multivariate Skew-Normality

It is well-known that the skew-normal distribution can provide an alternative model to the normal distribution for analyzing asymmetric data. The aim of this paper is to propose two goodness-of-fit tests for assessing whether a sample comes from a multivariate skew-normal (MSN) distribution. We address the problem of multivariate skew-normality goodness-of-fit based on the empirical Laplace tra...

متن کامل

Using of frailty model baseline proportional hazard rate in Real Data Analysis

Many populations encountered in survival analysis are often not homogeneous. Individuals are flexible in their susceptibility to causes of death, response to treatment and influence of various risk factors. Ignoring this heterogeneity can result in misleading conclusions. To deal with these problems, the proportional hazard frailty model was introduced. In this paper, the frailty model is ex...

متن کامل

Assessing positive matrix factorization model fit: a new method to estimate uncertainty and bias in factor contributions at the measurement time scale

A Positive Matrix Factorization receptor model for aerosol pollution source apportionment was fit to a synthetic dataset simulating one year of daily measurements of ambient PM2.5 concentrations, comprised of 39 chemical species from nine pollutant sources. A novel method was developed to estimate model fit uncertainty and bias at the daily time scale, as related to factor contributions. A circ...

متن کامل

Assessing positive matrix factorization model fit: a new method to estimate uncertainty and bias in factor contributions at the daily time scale

A Positive Matrix Factorization receptor model for aerosol pollution source apportionment was fit to a synthetic dataset simulating one year of daily measurements of ambient PM2.5 concentrations, comprised of 39 chemical species from nine pollutant sources. A novel method was developed to estimate model fit uncertainty and bias at 5 the daily time scale, as related to factor contributions. A ba...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014